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ABSTRACT: 

Recent linguistic theories cast surface complexity as the result of interacting subsys- 
tems of constraints. For instance, the ID/LP grammar formalism separates constraints on 
immediate dominance from those on linear order. Shieber (1983) has shown how to carry 
out direct parsing of ID/LP grammars. His algorithm uses ID and LP constraints directly 
in language processing, without expanding them into a context-free "object grammar." 
This report examines the computational difficulty of ID/LP parsing. Shieber's purported 
0(|G| • n 3 ) runtime bound underestimates the difficulty of ID/LP parsing; the worst-case 
runtime of his algorithm is exponential in grammar size. A reduction of the vertex-cover 
problem proves that ID/LP parsing is NP-complete. The growth of internal data struc- 
tures is the source of difficulty in Shieber's algorithm. The computational and linguistic 
implications of these results are discussed. Despite the potential for combinatorial explo- 
sion, Shieber's algorithm remains better than the alternative of parsing an expanded object 
grammar. 
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1. Introduction 

Under most recent linguistic theories, linguistic constraints fall into several subsystems 
each having its own character. Chomsky (1981:5), for instance, identifies the subtheories 
of bounding, government, 0-marking, binding, Case, and control, while Shieber (1983:2iT) 
describes a version of Gazdar and Pullum's GPSG formalism that involves immediate- 
dominance rules, linear-order constraints, and metarules. When several independent con- 
straints are involved, a rule system that explicitly multiplies out their effects is large, 
cumbersome, and uninformative. 1 For example, as Shieber (:4) points out, the expanded 
context-free "object grammar" derived by multiplying out the constraints in a typical GPSG 
system would contain trillions of rules. 

Given the disadvantages of multiplying out the effects of separate systems of con- 
straints, Shieber's (1983) work leads in a welcome direction. Shieber considers how one 
might do parsing with ID/LP grammars, which involve two orthogonal kinds of rules. ID 
rules constrain immediate dominance irrespective of constituent order ("a sentence can be 
composed of V with NP and SBAR complements"), while LP rules constrain linear prece- 
dence among the daughters of any node ("if V and SBAR are sisters, then V must precede 
SBAR"). Shieber shows how Barley's (1970) algorithm for parsing context-free grammars 
(CFGs) can be adapted to use the constraints of ID/LP grammars directly, without the 
combinatorially explosive step of converting the ID/LP grammar into standard context- 
free form. Instead of multiplying out all of the possible surface interactions among the 
ID and LP rules, Shieber's algorithm applies them one step at a time as needed. Surely 
this should work better in a parsing application than applying Earley's algorithm to an 
expanded grammar with trillions of rules, since the worst-case time complexity of Earley's 
algorithm is proportional to the square of the grammar size! 

Shieber's general approach is on the right track. On pain of having a large and cum- 
bersome rule system, the parser designer should first look to linguistics to find the correct 
set of constraints on syntactic structure, then discover how to apply some form of those 
constraints in parsing without multiplying out all possible surface manifestations of their 
effects. 

Nonetheless, nagging doubts about computational complexity remain. Although 
Shieber (1983:15) claims that his algorithm is identical to Earley's in time complexity, 
it seems almost too much to hope for that the size of an ID/LP grammar should enter into 
the time complexity of ID/LP parsing in exactly the same way that the size of a CFG enters 
into the time complexity of CFG parsing. An ID/LP grammar G can enjoy a huge size ad- 
vantage over a context-free grammar G' for the same language; for example, if G contains 
only the rule S — >id abede, the corresponding G' contains 5! = 120 rules. In effect, the 
claim that Shieber's algorithm has the same time complexity as Earley's algorithm means 
that this tremendously increased brevity of expression comes free (up to a constant). The 
paucity of supporting argument in Shieber's article does little to allay these doubts: 

We will not present a rigorous demonstration of time complexity, but it 
should be clear from the close relation between the presented algorithm 

and Earley's that the comp lexity is that of Earley's algorithm. In the 

'See Barton (1984) for discussion. 
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worst case, where the LP rules always specify a unique, ordering for the 
right-hand size of every ID rule, the presented algorithm reduces to Ear- 
ley's algorithm. Since, given the grammar, checking the LP rules takes 
constant time, the time complexity of the presented algorithm is identi- 
cal to Earley's .... That is, it is 0(\G\ n 3 ), where \G\ is the size of the 
grammar (number of ID rules) and n is the length of the input. (:14f) 

Many questions remain; for example, why should a situation of maximal constraint represent 
the worst case, as Shieber claims? 2 

The following sections will investigate the complexity of ID/LP parsing in more detail. 
In brief, the outcome is that Shieber's direct-parsing algorithm usually does have a time 
advantage over the use of Earley's algorithm on the expanded CFG, but that it blows up in 
the worst case. The claim of 0(\G\ n 3 ) time complexity is mistaken; in fact, the worst-case 
time complexity of ID/LP parsing cannot be bounded by any polynomial in the size of the 
grammar and input, unless P — M P. ID/LP parsing is NP-complete. 

As it turns out, the complexity of ID/LP parsing has its source in the immediate- 
domination rules rather than the linear precedence constraints. Consequently, the prece- 
dence constraints will be neglected. Attention will be focused on unordered context-free 
grammars (UCFGs), which are exactly like standard context-free grammars except that 
when a rule is used in a derivation, the symbols on its right-hand side are considered to 
be unordered and hence may be written in any order. UCFGs represent the special case of 
ID/LP grammars in which there are no LP constraints. Shieber's ID/LP algorithm can be 
used to parse UCFGs simply by ignoring all references to LP constraints. » 

2. Generalizing Earley's algorithm 

Shieber generalizes Earley's algorithm by modifying the progress datum that tracks 
progress through a rule. The Earley algorithm uses the position of a dot to track lin- 
ear advancement through an ordered sequence of constituents. The major predicates and 
operations on such dotted rules are these: 

• A dotted rule is initialized with the dot at the left edge, as in X — > .ABC 

• A dotted rule is advanced across a terminal or nonterminal that was predicted and 
has been located in the input by simply moving the dot to the right. For example, 
X — ► A. DC is advanced across a B by moving the dot to obtain X — > AB.C . 

• A dotted rule is complete iff the dot is at the right edge. For example, X -* ABC. 
is complete. 

• A dotted rule predicts a terminal or nonterminal iff the dot is immediately before 
the terminal or nonterminal. For example, X —* A.BC predicts B. 

UCFG rules differ from CFG rules only in that the right-hand sides represent unordered 
multisets (that is, sets with repeated elements allowed). It is thus appropriate to use suc- 
ce ssive acc umulation of set elements in place of linear advancement through a sequence. In 
2 See section 5; it ia in fact the best case. 
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essence, Shiebcr's algorithm replaces the standard operations on dotted rules with corre- 
sponding operations on what will be called dotted UCFG rules: 3 

• A dotted UCFG rule is initialized with the empty multiset before the dot and the 
entire multiset of right-hand elements after the dot, as in X — > {}.{A,B,C}. 

• A dotted UCFG rule is advanced across a terminal or nonterminal that was pre- 
dicted and has been located in the input by simply moving one element from the 
multiset after the dot to the multiset before the dot. For example, X — ► {A}.{D, C} 
is advanced across a B by moving the B to obtain X — > {A,B}.{C}. Similarly, 
X — ► {A}.{B,C, C} may be advanced across a C to obtain X — * {A, C}.{B, C}. 

• A dotted UCFG rule is complete iff the multiset after the dot is empty. For example, 
X — * {A, B,C}.{} is complete. 

• A dotted UCFG rule predicts a terminal or nonterminal iff the terminal or nonter- 
minal is a member of the multiset after the dot. For example, X — ♦ {A}.{B,C} 
predicts B and C. 

Given these replacements for operations on dotted rules, Shieber's algorithm operates in 
the same way as Earley's algorithm. As usual, each state in the parser's state sets consists 
of a dotted rule tracking progress through a constituent plus the interword position defining 
the constituent's left edge (Earley, 1970:95, omitting lookahead). The left-edge position is 
also referred to as the return pointer because of its role in the complete operation of the 
parser. 

3. The advantages of Shieber's algorithm 

The first question to ask is whether Shieber's algorithm saves anything. Is it faster to 
use Shieber's algorithm on a UCFG than to use Earley's algorithm on the corresponding 
expanded CFG? Consider the UCFG Gi that has only the single rule 5 — > abede. The 
corresponding CFG G\ has 120 rules spelling out all the permutations of abede: S — > abede, 
S — » abced, and so forth. If the string abede is parsed using Shieber's algorithm directly on 
Gi, the state sets of the parser remain small: 4 
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[S -+ {}.{a,b,c,d,e},0] 
[S->{a}.{b,c,d,e},0] 
[S -» {a,b}.{c,d,e},0} 
[S -+ {a,b,c}.{d,e},0\ 
[S^{a,6,c,d}.{ e },0] 
[S '-* {a,b,c.,d,e}.{},0] 



In contrast, consider what happens if the same string is parsed using Earley's algorithm on 
the exp a nded CFG with i ts 120 rules. As Figure 1 illustrates, the state sets of the Earley 

3 Shiebcr's representation differs in some ways from the representation used here, which was developed 
independently by the author. The differences are generally inessential, but see note 5. 

4 The states related to the auxilkiry start symbol and endmarker that are added by some versions of the 
Earley parser have been omitted for simplicity. 
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(a) 



(b) 



[5-+ {a}.{b,c,d, 


4.0] 


[S -* a.edcb, 0] 


[5 -+a.ec6cf,0] 


[5 -+ a.decb, 0] 


[5 — > a.cebd, 0] 


[S -> a.ecdb, 0] 


[5 — ► a.e&«/,0] 


[S — > a.cedb, 0] 


[5 — > a.becd, 0] 


[5 — > a.dceb,0] 


[5 — > a.cbed, 0] 


[S — > a.cdeb, 0] 


[5 — > a.6ceti,0] 


[S -> a.edbc, 0] 


[5 — ► a.dcbe, 0] 


[S —* a.debc, 0] 


[5 — » a.crf6e, 0] 


[S —y a.ebdc,0] 


[S — + a.dbce, 0] 


[S — » a.bedc, 0] 


[5 -* a.bdce, 0] 


[5 — > a.rffcec,0] 


[5 -> a.cbde,0\ 


[5 — ► a.bdec, 0] 


[S — * a.bcde, 0] 



o 



Figure 1: The use of the Shieber parser on a UCFG can enjoy a large advantage over the 
use of the Earley parser on the corresponding expanded CFG. After having processed the 
terminal a while parsing the string abcde as discussed in the text, the Shieber parser uses 
the single state shown in (a) to keep track of the same information for which the Earley 
parser uses the 24 states in (b). 



parser are much larger. In state set Si, the Earley parser uses 4! = 24 states to spell out 
all the possible orders in which the remaining symbols {b, c, d, e} could appear. Shieber's 
modified parser does not spell them out, but uses the single state [S —* {a}.{b, c,d, e},0] to 
summarize them all. Shieber's algorithm should thus be faster, since both parsers work by 
successively processing all of the states in the state sets. 

Similar examples show that the Shieber parser can enjoy an arbitrarily large advantage 
over the use of the Earley parser on the expanded CFG. Instead of multiplying out all surface 
appearances ahead of time to produce an expanded CFG, Shieber's algorithm works out 
the possibilities one step at a time, as needed. This can be an advantage because not all of 
the possibilities may arise with a particular input. 



4. Combinatorial explosion with Shieber's algorithm 
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The answer to the first question is yes, then: it can be more efficient to use Shieber's 
parser than to use the Earley parser on an expanded "object grammar." The second question 
to ask is whether Shieber's parser always enjoys a large advantage. Does the algorithm blow 
up in difficult cases? 

In the presence of lexical ambiguity, Shieber's algorithm can suffer from combinatorial 
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explosion. Consider the following UCFG, G2, in which x is five- ways ambiguous: 

5 -> ABODE 
A —* a\ x 
B -*b \x 
C —> c\x 
D ->d\x 
E ~* e I x 

What happens if Shieber's algorithm is used to parse the string xxxxa according to this 
grammar? After the first three occurrences of x have been processed, the state set of 
Shieber's parser will reflect the possibility that any three of the phrases A t B, C, D, and E 
might have been encountered in the input and any two of them might remain to be parsed. 
There will be («) = 10 states reflecting progress through the rule expanding S, in addition to 
5 states reflecting phrase completion and 10 states reflecting phrase prediction (not shown): 

S s : [S - {A, B, C}.{D, E), 0] [S - {A, B t D}.{C, E}, 0] 
[S -> {A,C,D}.{B,E},0} [S -> {B,C,D}.{A,E},0} 
[S - {A,B,E}.{C,D},0} [S - {A,C,E}.{B,D},0} 
[S -> {B,C,£}.{/1,Z>},0] [5 -> (A, D, £}.{£, C},0] 
^-{B^.^Mil.ChO] [5-{C,D,^}.{y4,B},0] 
In cases hke this, Shieber's algorithm enumerates all of the combinations of k elements taken 
1 at a time, where k is the rule length and i is the number of elements already processed. 
Thus it can be combinatorially explosive. 

It is important to note that even in this case, Shieber's algorithm wins ovrt over parsing 
the expanded CFG with Earley's algorithm. After the same input symbols have been 
processed, the state set of the Earley parser will reflect the same possibilities as the state 
set of the Shieber parser: any three of the required phrases might have been located, while 
any two of them might remain to be parsed. However, the Earley parser has a less concise 
representation to work with. In place of the state involving S — > {A, B,C}.{D,E), for 
instance, there will be 3! • 2! = 12 states involving S -» ABC.DE, S -+ BCA.ED, and so 
forth. 5 Instead of a total of 25 states, the Earley state set will contain 135 = 12-10 + 15 
states. 

In the above case, although the parser could not be sure of the catcgorial identities of 
the phrases parsed, at least there was no uncertainty about the number of phrases and their 
extent. We can make matters even worse for the parser by introducing uncertainty in those 
areas as well. Let G3 be the result of replacing every x in G2 with the empty string e: 

S -> ABODE 
A—*a\e 
B->b\e 
C-*c\t 
D-*d\t 
E -> e I e 



5 In contrast to the representation illustrated here, Shieber's representation actually suffers to some extent 
from the same problem. Shieber (1083:10) uses an ordered sequence instead of a multiset before the dot; 
consequently, in place of the state involving S ~» {A, D,C}.{D, E}, Shieber would have the 3! = 6 states 
involving S — ► a.{D,B), where a ranges over the six permutations of ADC. 
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Then an A, for instance, can be either an a or nothing. Before any input has been read, 
the first state set So in Shieber's parser must reflect the possibility that the correct parse 
may include any of the 2 5 = 32 possible subsets of A, B, C, D, and E as empty initial 
constituents. For example, S must include [S — * {A, B,C, D,E}.{},0] because the input 
might turn out to be the null string. Similarly, it must include [S — * {A, C, E}.{B, D},0] 
because the input might turn out to be bd or db. Counting all possible subsets in addition to 
other states having to do with predictions, completions, and the parser's start symbol, there 
are 44 states in Sq. (There are 338 states in the corresponding state when the expanded 
CFG G' 3 is used.) 

5. The source of the difficulty 

Why is Shieber's algorithm potentially exponential in grammar size despite its "close 
relation" to Barley's algorithm, which has time complexity polynomial in grammar size? 
The answer lies in the size of the state space that each parser uses. Relative to grammar size, 
Shieber's algorithm involves a much larger bound than Earley's algorithm on the number 
of states in a state set. Since the main task of the Earley parser is to perform scan, predict, 
and complete operations on the states in each state set (Earley, 1970:97), an explosion in 
the size of the state sets will be fatal to any small runtime bound. 

Given a CFG G a , how many possible dotted rules are there? Resulting from each rule 
/*■""% X — > Ai . . . A/t, there are k + 1 possible dotted rules. Then the number of possible dotted 

rules is bounded by \G a \, if this notation is taken to mean the number of symbols that it 
takes to write G a down. An Earley state is a pair [r,»], where r is a dotted ride and t is 
an interword position ranging from to the length n of the input string. Because of these 
limits, no state set in the Earley parser can contain more than 0(|G o | -n) (distinct) states. 

The limited size of a state set allows an 0(|C? a | • n 3 ) bound to be placed on the 
runtime of the Earley parser. Informally, the argument (due to Earley) runs as follows. 
The scan operation on a state can be done in constant time; the scan operations in a 
state set thus contribute no more than <9(|(?q| • n) computational steps. All of the predict 
operations in a state set taken together can add no more states than the number of rules 
in the grammar, bounded by \G a \, since a nonterminal needs to be expanded only once in 
a state set regardless of how many times it is predicted; hence the predict operations need 
not take more than 0(|(?„| • n + \G a \) = 0(\G a \ ■ n) steps. Finally, there are the complete 
operations to be considered. A given completion can do no worse than advancing every 
state in the state set indicated by the return pointer. Therefore, k completions require at 
most A; 2 steps; the complete operations. in a state set can take no more than 0(|(?„| • n ) 
steps. Overall, then, it takes no more than 0(|G o | ■ '* 2 ) steps to process one state set and 
no more than 0(|G a | • n 3 ) steps for the Earley parser to process them all. 

In Shieber's parser, though, the state sets can grow much larger relative to grammar 
size. Given a UCFG (7j,, how many possible dotted UCFG rules are there? Resulting from 
a ride X —> Ai . . .A^, there are not k + 1 possible dotted rules tracking linear advancement, 
but 2 k possible dotted UCFG rules tracking accumulation of set elements. In the worst 
case, the grammar contains only one rule and k is on the order of \Gb\; hence the number 
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Figure 2: This grapli illustrates a trivial instance of the vertex cover problem. The set 
{c, d} is a vertex cover of size 2. 



of possible dotted UCFG rules for the whole grammar is not bounded by |Gj|, but by 2l G *L 
(Recall the exponential blowup demonstrated for grammar G$ in section 4.) 

Informally speaking, the reason why Shieber's parser sometimes suffers from combi- 
natorial explosion is that there are exponentially more possible ways to progress through 
an unordered rule expansion than an ordered one. When disambiguating information is 
scarce, the parser must keep track of all of them. In the more general task of parsing 
ID/LP grammars, the most tractable case occurs when constraint from the LP relation ia 
strong enough to force a unique ordering for every rule expansion. Under such conditions, 
Shieber's parser reduces to Earley's. However, the case of strong constraint represents the 
best case computationally, rather than the worst case as Shieber (1983:14) claims. 



6. ID/LP parsing is inherently difficult 

The worst-case time complexity of Shieber's algorithm is exponential in grammar size 
rather than quadratic as Shieber (1983:15) believed. Did Shieber simply choose a poor 
algorithm, or is ID/LP parsing inherently difficult in the general case? In fact, the simpler 
problem of recognizing sentences according to a UCFG is NP-complete. 6 Consequently, un- 
less P = MP, no algorithm for ID/LP parsing can have a runtime bound that is polynomial 
in the size of the grammar and input. 

The proof of NP-completeness involves reducing the vertex cover problem (Garey 
and Johnson, 1979:46) to the UCFG recognition problem. Through careful construction 
of the grammar and input string, it is possible to "trick" the parser into solving a known 
hard problem. The vertex cover problem involves finding a small set of vertices in a graph 
with the property that every edge of the graph has at least one endpoint in the set. Figure 2 
shows a trivial example. 

To construct a grammar that encodes the question of whether the graph hi Figure 2 
h as a vertex cover of size 2, first take the vertex names a, b, c, and d as the alphabet. Take 

Recognition is simpler than parsing because a recognizer is not required to recover the structure of an input 
string, but only to decide whether the string is in the language generated by the grammar: that is, whether 
or not there exists a parse. 



^"""N 



START -» HiH 2 H s H 4 UUDDDD 

Hi — > a | c 

#2 — ► 6 | c 

if 3 -> c I d 

# 4 -+ 6|d 

(7 — * aaaa | 6666 | cccc | dddd 

D —* a \ b\ c \ d 



Figure 3: For k. — 2, the construction described in the text transforms the vertex-cover 
problem of Figure 2 into this UCFG. A parse exists for the string aaaabbbbccccdddd iff the 
graph in the previous figure has a vertex cover of size < 2. 



START as the start symbol. Take Hi through Hi as special symbols, one per edge; also 
take U and D as special dummy symbols. 

Next, write the rules corresponding to the edges of the graph. Edge ei runs from a 
to c, so include the rules Hi — > a and Hi — > c. Encode the other edges similarly. Rules 
expanding the dummy symbols are also needed. Dummy symbol D will be used to soak up 
/^*\ excess input symbols, so D — > a through D —+ d should be rules. Dummy symbol U will 

also be used to soak up excess input symbols, but U will be allowed to match only when 
there are four occurrences in a row of the same symbol (one occurrence for each edge). Take 
U —* aaaa, U — ► 6666, and U — > cccc, and U — » dddd as the rules expanding U. 

Now, what does it take for the graph to have a vertex cover of size A; = 2? One way 
to get a vertex cover is to go through the list of edges and underline one endpoint of each 
edge. If the vertex cover is to be of size 2, the underlining must be done in such a way that 
only two distinct vertices are ever touched in the process. Alternatively, since there are 4 
vertices in all, the vertex cover will be of size 2 if there are 4 — 2 = 2 vertices left untouched 
in the underlining process. This method of finding a vertex cover can be translated into a 
UCFG rule as follows: 

START -» HJhHzHJJUDDDD 

That is, each //-symbol is supposed to match the name of one of the endpoints of the 
corresponding edge, in accordance with the rules expanding the //-symbols. Each [/-symbol 
is supposed to correspond to a vertex that was left untouched by the //-matching, and the 
D-symbols are just there for bookkeeping. Figure 3 lists the complete grammar that encodes 
the vertex-cover problem of Figure 2. 

To make all of this work properly, take 

a = aaaabbbbccccdddd 

as the input string to be parsed. (In general, for every vertex name x, include in a a 
contiguous run of occurrences of x, one occurrence for each edge in the graph.) The grammar 
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encodes the underlining procedure by requiring each iif-symbol to match one of its endpoints 
in a. Since the right-hand side of the START rule is unordered, the grammar allows an 
if -symbol to match anywhere in the input, hence to match any vertex name (subject to 
interference from other rules that have already matched). Furthermore, since there is one 
occurrence of each vertex name for every edge, all of the edges could conceivably be matched 
up with the same vertex; that is, it's impossible to run out of vertex-name occurrences. 
Consequently, the grammar will allow either endpoint of an edge to be "underlined." The 
parser will have to figure out which endpoints to choose — in other words, which vertex cover 
to select. However, the grammar also requires two occurrences of U to match somewhere. 
U can only match four contiguous identical input symbols that have not been matched hi 
any other way, and thus if the parser chooses a vertex cover that is too large, the {/-symbols 
will not match and the parse will fail. The proper number of D-symbols is given by the 
length of the input string, minus the number of edges in the graph (to account for the 
ffi-matches), minus k times the number of edges (to account for the {/-matches): in this 
case, 16 - 4 - (2 • 4) = 4, as illustrated in the START rule. 

The net result of this construction is that in order to decide whether a is in the language 
generated by the UCFG, the parser must in effect search for a vertex cover of size 2 or less. 7 
If a parse exists, an appropriate vertex cover can be read off from beneath the i7-symbols in 
the parse tree; conversely, if an appropriate vertex cover exists, it indicates how to construct 
a parse. Figure 4 shows the parse tree that encodes a solution to the vertex-cover problem 
of Figure 2. 

if****-, The construction shows that vertex-cover problem is reducible to UCFG recognition. 

Furthermore, the construction of the grammar and input string can be carried out in poly- 
nomial time. Consequently, UCFG recognition and the more general task of ID/LP parsing 
must be computationally difficult. For a more careful and detailed treatment of the reduc- 
tion and its correctness, see the appendix. 

7. Computational implications 

The reduction of Vertex Cover shows that the ID/LP parsing problem is NP-complete. 
Unless P — M P, the time complexity of ID/LP parsing cannot be bounded by any polyno- 
mial in the size of the grammar and input. 8 An immediate conclusion is that complexity 
analysis must be done carefully: despite its similarity to Earley's algorithm, Shieber's algo- 
rithm does not have complexity 0(\G\ ■ n 3 ). For some choices of grammar and input, its 
internal structures undergo exponential growth. Other consequences also follow. 

7.1. Parsing the object grammar 

Even in the face of its combina torially explosive worst-case behavior, Shieber's algo- 

7 If the vertex cover is smaller than expected, the JD-symbols will soak up the extra contiguous runs that 
could have been matched by more {/-symbols. 

8 Even assuming P ^ MP, it does not follow that the time complexity must be exponential, though it seems 
likely to be. There arc functions such as n'"* " that fall between polynomials and exponentials. See 
Hopcroft and Ullman (1970:341). 
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Figure 4: The grammar of Figure 3, which encodes the vertex-cover problem of Figure 2, 
generates the string a = aaaabbbbccccdddd according to this parse tree. The vertex cover 
{c, d} can be read off from the parse tree as the set of elements dominated by .ff-symbols. 



rithm should not be immediately cast aside. Despite the fact that it sometimes blows up, 
it still has an advantage over the alternative of parsing the expanded "object grammar." 
One interpretation of the NP-completeness result is that the general case of ID/LP parsing 
is inherently difficult; hence it should not be surprising that Shieber's algorithm for solving 
that problem can sometimes suffer from combinatorial explosion. More significant is the fact 
that parsing with the expanded CFG blows up in cases that should not be difficult. There 
is nothing inherently difficult about parsing the language that consists of all permutations 
of the string abode, but while parsing that language the Earley parser can use 24 states or 
more to encode what the Shieber parser encodes in only one (§3). To put the point another 
way, the significant fact is not that the Shieber parser can blow up; it is that the use of an 
expanded CFG blows up unnecessarily. 
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7.2. Is precompilation possible? 

The present reduction of Vertex Cover to ID/LP Parsing involves constructing a gram- 
mar and input string that both depend on the problem to be solved. Consequently, the 
reduction does not rule out the possibility that through clever programming one might 
concentrate most of the computational difficulty of ID/LP parsing into a separate precom- 
pilation stage, dependent on the grammar but independent of the input. According to this 
optimistic scenario, the entire procedure of preprocessing the grammar and parsing the in- 
put string would be as difficult as any NP-complete problem, but after precompilation, the 
time required for parsing a particular input would be bounded by a polynomial in grammar 
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size and sentence length. 

Regarding the case immediately at hand, Shieber's modified Earley algorithm has no 
precompilation step. 9 The complexity result implied by the reduction thus applies with 
full force; any possible precompilation phase has yet to be proposed. Moreover, it is by no 
means clear that a clever precompilation step is even possible; it depends on exactly how 
\G\ and n enter into the complexity function for ID/LP parsing. If n enters as a factor 
multiplying an exponential, precompilation cannot help enough to ensxire that the parsing 
phase will run in polynomial time. 

For example, suppose some parsing problem is known to require 2' > ■ n 3 steps for 
solution. 10 If one is willing to spend, say, 10 • 2'°' steps in the precompilation phase, is it 
possible to reduce parsing-phase complexity to something like |G| • n 3 ? The answer is no. 
Since by hypothesis it takes at least 2' G ' • n 3 steps to solve the problem, there must be at 
least 2' G I -n 3 — 10 • 2' G i steps left to perform after the precompilation phase. The parameter 
n is necessarily absent from the precompilation complexity, hence the term 2' G ' • n 3 will 
eventually dominate. 

In a related vein, suppose the precompilation step is conversion from ID/LP to CFG 
form and the runtime step is the use of the Earley parser on the expanded CFG. Although 
the precompilation step does a potentially exponential amount of work in producing G' 
from G, another exponential factor still shows up at runtime because \G'\ in the complexity 
bound \G'\ n 3 is exponentially larger than the original |G|. 

7.3. Polynomial-time parsing of a fixed grammar 

As noted above, both grammar and input in the current vertex-cover reduction de- 
pend on the vertex-cover problem to be solved. The NP-completeness result would be 
strengthened if there Were a reduction that used the same fixed grammar for all vertex- 
cover problems, for it would then be possible to prove that a precompilation phase would 
be of little avail. However, unless P = MP, it is impossible to design such a reduction. Since 
grammar size is not considered to be a parameter of a fixed-grammar parsing problem, the 
use of the Earley parser on the object grammar constitutes a polynomial-time algorithm for 
solving the fixed-grammar ID/LP parsing problem. 

Although ID/LP parsing for a fixed grammar can thus be done in cubic time, that fact 
has little practical significance. The object grammar G' corresponding to a practical ID/LP 
grammar would be Inige, and if |G'| • ra 3 complexity is too slow, then it remains too slow 
when \G'\ is regarded as a constant. 

7.4. The power of the UCFG formalism 

The Vertex Cover reduction also helps pin down the computational power of the TJCFG 
formalism. As Gj and G[ in sectio n 3 illustrated, a UCFG (or an ID/LP grammar) can enjoy 

9 Shieber (1983:15 n. 0) mentions a possible precompilation step, but it is concerned with the LP relation 
rather than the ID rules. 

10 It is not known whether the worst-case complexity of ID/LP parsing is exponential, since more generally 
it is not known for sure that P -/ MP. 
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considerable brevity of expression compared to the equivalent CFG. The NP-completeness 
result illuminates this property in two ways. First, the result shows that this brevity of 
expression is sufficient to allow an instance of any problem in MP to be stated in a UCFG 
that is only polynomially larger than the original problem instance. In contrast, if an 
attempt is made to replicate the current reduction with a CFG rather than UCFG, the 
necessity of spelling out all the orders in which the i7-, U-, and D-symbols might appear 
makes the CFG more than polynomially larger than the problem instance. Consequently, 
the reduction fails to establish NP-completeness, which indeed docs not hold. Second, 
the result shows that the increased expressive power does not come free; while the CFG 
recognition problem can be solved in cubic time or less, 11 unless P = MP the general UCFG 
recognition problem cannot be solved in polynomial time. 

The details of the reduction also help pin down how powerful a single UCFG rule can 
be. If the UCFG formalism is extended to permit ordinary CFG rules in addition to rules 
with unordered expansions, the grammar that expresses a vertex-cover problem needs only 
one UCFG rule, although that rule may need to be arbitrarily long. 

7.5. The role of constraint 

Finally, the discussion of section 5 illustrates the way in which the weakening of con- 
straints can often make a problem computationally more difficult. It might erroneously be 
thought that weak constraints represent the best case in computational terms, for "weak" 
constraints sound easy to verify. However, oftentimes the weakening of constraint multiplies 
the number of possibilities that must be considered in the course of solving a'problem. In 
the case at hand, the removal of constraints on the order in which constituents can appear 
causes the dependence of parsing complexity on grammar size to grow from \G\ to 2l G l. 

8. Linguistic implications 

Significantly, the key ingredients that can cause difficulties for the ID/LP parsing al- 
gorithm are not exotically foreign to linguistic theory. Most current formalisms (e.g. GB- 
theory and GPSG) permit the existence of constituents that are empty on the surface; hence 
in principle they permit the kind of pathological case illustrated by Gz in section 4, subject 
to amelioration by additional constraints. Similarly, a key ingredient of the vertex-cover 
reduction is lexical ambiguity — acknowledged by every current theory. 

Nonetheless, the implications of the NP-completeness result for grammatical theory 
are fewer than they might seem. The reduction contributes to the necessary goal of under- 
standing the computational power of various mechanisms and formal devices, but it does 
not (for instance) rule out the use of formalisms that decouple constraints on order from 
constraints on linear precedence. 

Under the assumption that natural languages are efficiently parsable, computational 
diffi culties in parsing a formalism do indicate that the formalism itself does not tell the 

n Since 0(|<7j 2 • n 3 ) < 0((|(?| + n) 3 ), the complexity of Barley's algorithm is no worse than cubic in the 
combined length of grammar and input. 
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whole story. That is, they point out that the range of possible languages has been incor- 
rectly characterized: the additional constraints that guarantee efficient parsability remain 
unstated. Since the general case of parsing ID/LP grammars is computationally difficult, if 
the linguistically relevant ID/LP grammars axe to be efficiently parsable, there must be ad- 
ditional factors that guarantee, say, a certain amount of constraint from the LP relation. 12 
(Constraints beyond the bare ID/LP formalism are required on linguistic grounds as well.) 
Note that the subset principle of language acquisition (cf. Berwick and Weinberg, 1984:233) 
would lead the language learner to initially hypothesize strong order constraints, to be weak- 
ened only in response to positive evidence. 

However, there are other potential ways to guarantee efficient parsability. It might turn 
out that the principles and parameters of the best grammatical theory permit languages that 
are not efficiently parsable in the worst case — just as grammatical theory permits sentences 
that are deeply center-embedded (Miller and Chomsky, 19G3). 13 In such a situation, difficult 
languages or sentences would not be expected to turn up in general use, precisely because 
they would be difficult to process. 14 The factors that guarantee efficient parsability would 
not be part of grammatical theory because they would result from extragrammatical factors, 
i.e. the resource limitations of the language-processing mechanisms. This "easy way out" 
is not aiitomatically available, depending as it does on a detailed account of processing 
mechanisms. For example, in the Earley parser, the difficulty of parsing a construction 
can vary widely with the amount of lookahead used (if any). Like any other theory, an 
explanation based on resource limitations must make the right predictions about which 
constructions will be difficult to parse. 

In the same way, the language-acquisition procedure could potentially be the source of 
some constraints relevant to efficient parsability. Perhaps not all of the languages permitted 
by the principles and parameters of syntactic theory are accessible in the sense that they 
can potentially be constructed by the language-acquisition component. It is to be expected 
that language-acquisition mechanisms will be subject to various kinds of limitations just 
as all other mental mechanisms are. Again, however, concrete conclusions must await a 
detailed proposal. 



a^\ 



12 In the GB-framework of Chomsky (1981), for instance, the syntactic expression of unordered 0-grid9 at the 
X level is constrained by the principles of Case theory. Endocentricity is another significant constraint. See 
also Berwick's (1982) discussion of constraints that could be placed on another grammatical formalism — 
lexical-functional grammar — to avoid a similar intractability result. 

13 Indeed, one may not conclude a priori that the languages permitted by linguistic theory are parsable at all 
(Chomsky, 1980). 

14 It is often anecdotally remarked that languages that allow relatively free word order tend to make heavy 
use of inflections. A rich inflectional system can supply parsing constraints that make up for the lack 
of ordering constraints; thus the situation we do not find is the computationally difficult case of weak 
constraint. 
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9. Appendix 

This appendix contains the details of a careful reduction of the vertex-cover problem to 
the UCFG recognition problem. This version of the reduction establishes that the difficulty 
of UCFG recognition is not due either to the possibility of empty constituents (e-rules) or 
to the possibility of repeated symbols in rules (i.e. to the use of multisets rather than sets). 
Consequently, it is somewhat different from and more complex than the one sketched in the 
text. 

9.1. Defining unordered context-free grammars 

Definition: An unordered CFG (UCFG) is a quadruple (iV.E, /?,£), where: 

(a) N is a finite set of nonterminal. 

(b) E disjoint from N is a finite, nonempty set of terminal symbols. 

(c) R is a nonempty set of rules {A, a), where A G N and a G (N U E)*. The rule 
(A, a) may be written as A — > a. 

(d) S G N is the start symbol. 

Convention: The grammar G and its components N,E,R, S need not be explicitly men- 
tioned when clear from context. 

Convention: Unless otherwise noted, 

(a) A, A',Ai,. . . denote elements of N; • 

(b) a, a', a,-, . . . denote elements of E; 

(c) X, Y, X',Y',Xi, Yi, . . . denote elements of N U E; 

(d) o~, u, u', Ui, . . . denote elements of E*; 

(e) a,j3,i,<p,ij) denote elements of (JVU E)*. 

Definition: G = (N, E, R, S) is e-free iff for every (A, a) £ R, \a\ / 0. 

Definition: G — (N t H,R,S) is branching iff for some (A, a) 6 R, \a\ > 1. 

Definition: G = (N,'S,R,S) is duplicate-free iff for every (A, a) € R, a — Yi...Y n and 
for all i,j G [l,n], Y t = Yj iff i = j. 

Definition: G = (iV, E, R, S) is simple iff it is e-free, duplicate-free, and branching. 

Note. The notion of a simple UCFG is introduced in order to help pin down the source of 
any computational difficulties associated with UCFGs. For example, since simple UCFGs 
are restricted to be duplicate-free, a difficulty that arises with simple UCFGs cannot result 
from the possibility that a symbol may occur more than once on the right-hand side of a 
rule. 

Definition: <pAi/> => /paip (by r) just in case (for some) r = (A',Yi ...Y n ) S R and 

G 

for some permutation p of [l,n], A — A' and a = Yp(i) '••Yp{n)- If <p G E*, also write 
<pAip ^i m ipaip. 

G 

Definition: L{G) = {a 6 E* : S -->* a}. 
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V = {v,w,x,y,z} 
E = {ei,e2,ei,e it e 5 ,eQ,e r } 
with the ej as indicated 
fc = 3 



o 



Figure 5: The triple {V,E,k) is an instance of VERTEX COVER. The set V = {v,x,z} is 
a vertex cover of size fc = 3. 



Definition: An n-step derivation of ^ from V2 is a sequence (£>n, • • • > Vn) such that vo = V> 
¥>„ = ^, and for all i E [0,n - 1], £>< =» y? i+1 . If it is also true for all t that ip { ^ !m p,-+i, 
say that the derivation is leftmost. 
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9.2. Defining the computational problems 

Definition: A possible instance of the problem VERTEX COVER is a triple (V, E, fc}, 
where (V, E) is a finite graph with at least one edge and at least two vertices, fc G If, and 
fc < |V |.« VERTEX COVER itself consists of all possible instances (V, E, fc) such that for 
some V C V, \V'\ < fc and for all edges e G E, at least one endpoint of e is in V. (Figure 5 
gives an example of a VERTEX COVER instance.) 

Fact: VERTEX COVER is NP-complete. (Garey and Johnson, 1979:46) 

Definition: A possible instance of the problem SIMPLE UCFG RECOGNITION is a pair 
(G,<r), where G is a simple UCFG and a G E\ SIMPLE UCFG RECOGNITION itself 
consists of all possible instances {G,a} such that a G L(G). 

Notation: Take ||-|| to be any reasonable measure of the encoded input length for a com- 
putational problem; continue to use |-| for set cardinality and string length. It is reasonable 
to require that if- S is a set, fc G W, and \S\ > fc, then ||5|| > ||fc||; that is, the encoding of 
15 This formulation differs trivially from the one cited by Garey and Johnson. 
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numbers is better than unary. It is also reasonable to require that ||(. . . , x, . . ,}|j > ||a;||. 

9.3. The UCFG recognition problem is in NP 

Lemma 9.1: Let (<po> • • • > <Pk) be a shortest leftmost derivation of <pk from <pq in a branch- 
ing e-free UCFG. If k > \N\ + 1 then \<p k \ > \>p \. 

Proof. There exists some sequence of rules (j4o,ao),.. .,{j4*-i,afc-i) such that for all 
J € [0, k - 1], fi =>i m tp i+1 by (Ai,oti). Since G is e-free, |v3,-+i| > \>Pi\ always. 

Case 1. For some i, |a,| > 1. Then |y>i+i| > |<p,|. Hence \(pk\ > \(po\. 

Case 2. For every i, |a,| = 1. Then there exist u,7 such that for every i 6 [0, k - 2], there 
is A[ e N such that <pi +i = uA'ff. Suppose the A[ are all distinct. Then \N\ > k - 1, 
hence \N\ + 1 > k, hence \N\ + 1 > \N\ + 1, which is impossible. Hence for some i,j 6 
[0, k - 2], t* < j, A\ = A'y Hence y> t - + i = Vj'+ii since [1, 1] has only one permutation. Then 
(tp ,...,ipi,ipj + i,... fipk) is a leftmost derivation of ^ from £>o and bas length less than 
k, which is also impossible. 

Then \<p k \ > \<po\- D 

Corollary 9.2: If £ is a branching e-free UCFG and a e L(G) then a has a leftmost 
derivation of length at most \a\ ■ m, where m = \N\ + 2. 

Proof. Let (tpo, . . . , <p k ) be a shortest leftmost derivation of c from 5. Suppose A; > |<r| • to. 
/f m ^-, Consider the sub- derivations 

(¥ , (|ff|-l)-m>"-><P|<r|.m) 
(V>|-mj---><Pfc)- 

Each one except the last has m steps and m > \N\ + 1. Then by lemma, 

|^V|-m| > |v(|cr|~l).m| > ' ' > \<Pm\ > \fo\ = 1- 

Then \a\ > 1 + \a\, which is impossible. Hence k < \a\ • m. Q 

Lemma 9.3: n - SIMPLE UCFG RECOGNITION is in the computational class MP. 

Proof. Let G = <iV,E,/?,S) be a simple UCFG and a £ S*. Consider the following 
nondeterministic algorithm with input (G,a): 

Step 1. Write down <po = S. 

Step 2. Perform the following steps for i from to |a| • m — 1, where m = \N\ + 2. 

(a) Express <pi as te,^4;7,- by finding the leftmost nonterminal, or loop if impossible. 

(b) Guess a rule (Ai, F,,i . . . F,-,*,. ) 6 R and a permutation /?,• of [1, fc t ], or loop if there is 
no such rule, 
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(c) Write down ip i+ i = u,T 1)/0 .(i) . . . Y iiPi ( ki )1i- 

(d) If <Pi+i = a then halt. 

Step 3. Loop. 

It should be apparent that the algorithm runs in time at worst polynomial in ||(G,ff)||; note 
that the length of tpi increases by at most a constant amount on each iteration. 

Assume (G,a) € II. Then a has a leftmost derivation of length at most |<r| • m by Corol- 
lary 9.2; hence the nondeterministic algorithm will be able to guess it and will halt. Con- 
versely, suppose the algorithm halts on input (G,cr). On the iteration when the algorithm 
halts, the sequence (<po, . . . ,<Pi+i) will constitute a leftmost derivation of a from S; hence 
<r€ 1(G) and <G,<r)eII. 

Then there is a nondeterministic algorithm that runs in polynomial time and accepts exactly 
II. Hence II € HP. Q 

9.4. The UCFG recognition problem is NP-complete 

Lemma 9.4: Let (V, E, k) = {V, {e,}, k) be a possible instance of VERTEX COVER. Then 
it is possible to construct, in time polynomial in ||V||, ||i?||, and k, a simple UCFG G(V, E, k) 
and a string cr{V, E, k) such that 

(G{V,E,k),<r(V,E,k)) e SIMPLE UCFG RECOGNITION 
iff (V, E, k) e VERTEX COVER. 

Proof. Construct G(V, E, k) as follows. Let the set N of nonterminals consist of the following 
symbols not in V: 

START, U,D, 

Hi for t € [1, \E\], 

Ui for .• G [1, |V| - ft], 

Di for *€ [1,|^|- (ft -1)]. 

||iV|| will be at worst polynomial in \\E\\, \\V\\, and fc for a reasonable length measure. Define 
the terminal vocabulary S to consist of subscripted symbols as follows: 

E = {m:oeV,i€[l,|B|]}. 

Designate START as the start symbol. Include the following as members of the rule set R: 

(a) Include the rule 

START ->//!... H m U l . . . U m -*D X . . . X>|b|.(*-i). 

(b) For each e t - £ E, include the rules 

{H{ — ► o,- : a an endpoint of e 4 }. 
17 
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START -» HiHiHsHiHsHtHjUiUiDiDaDiDiDsDiDrDiDoDwDnDuDuDH 

Hi -* vi\ wi H 2 -* «2 | yi Hi -y iw 3 | x s 

H4, -> io 4 | Zi H 5 -y x 5 | y 5 if 6 -+ y 8 | z 6 

"7 — * 2" 7 I ^7 

U t -+ U U 2 -+ U U s - ff 

Ui -^ u 

U -* ViV2V3ViV5VQV 7 | «; 1 «;2W3«;4ti;5t£)6l07 I XiX23;sl4a:5.'C6X7 

I 2/12/22/32/42/52/807 I *1 2 2 23*4 25 2 6 2 7 



01 


— ► 


P 


04 


— + 


£ 


#7 


— + 


P 


^10 


-+ 


£ 


013 


—+ 


z? 



02 


-> 


D 


05 


— > 


D 


08 


— > 


D 


011 


-» 


D 


014 


— + 


D 



03 


— * 


Z) 


06 


— > 


D 


09 


— > 





012 


— + 


£> 



D -> Vi | «2 | «3 | V4 | «5 | t>8 | «7 | t«l | U>2 | f S | W4 | U?5 | «>8 | W7 

I &i I a; 2 I x 3 I z 4 I S5 I ze I a; 7 | yi | y 2 1 2/3 1 2/4 [ 2/5 1 2/8 | Vi 

f~S. | «1 | *2 I 2 3 I 2 4 | «5 | 2 8 I 2 7 



Figure 6: The construction of Lemma 9.4 produces this grammar when applied to the' 
VERTEX COVER problem of Figure 5. The Tf-symbols ensure that the solution that is 
found must hit each of the edges, while the (7-symbols ensure that enough elements of V 
remain untouched to satisfy the requirement \V'\ < k. The P-symbols are dummies that 
absorb excess input symbols. A shorter grammar than this will suffice if the grammar is 
not required to be duplicate-free. 



(c) For each i 6 [1, \V\ — k], include the rule U, — » U. Also include the rules 

{U -> a 1 . . . a\E\ • a 6 V}. 

(d) For each i € [1, \E\ ■ (k — 1)], include the rule Z), — > D. Also include the rules 

{Z)~>a:a€E}. 

Take G{V,E,k) to be (N,E,R, START). (Figure 6 shows the result of applying this con- 
struction to the VERTEX COVER instance of Figure 5.) 
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Let h : [1,|V|] — ► V be some standard enumeration of the elements of V. Construct 
a{V,E,k) as h[l) l ...h{l)\ E \...h{\V\) l ...h(\V\)\ E \; thus a(V,E,k) will have length \E\ ■ 
\V\. 

It is easy to see that \\(G(V,E,k),cr(V,E,k))\\ will be at worst polynomial in ||2?(|, ||V||, 
and k for reasonable ||-||. It will also be possible to construct the grammar and string in 
polynomial time. Finally, note that given the definition of a possible instance of VERTEX 
COVER, the grammar will be branching, e-free, and duplicate-free, hence simple. 

Now suppose (V, E, k) € VERTEX COVER. Then there exist V C V and f : E -y V such 
that |V'| < k and for every e G E, /(e) is an endpoint of e. E is nonempty by hypothesis 
and V must hit every edge, hence \V'\ cannot be zero. Construct a parse tree for <x(V, E, k) 
according to G(V, E, k) as follows. 

Step 1. Number the elements of V — V as {x;: t £ [1, \V — V'\]}. For each s,- where 
* ^ \V\ — k, construct a node dominating the substring (z<)i . . . (zt)|£| of c(V, E, k) and 
label it U. Then construct a node dominating only the {/-node and label it [/,-. Note that 
the available symbols [/,- are numbered from 1 to |V| — fc, so it is impossible to run out of 
[/-symbols. Also, \V'\ < k and V C V, hence \V - V'\ < \V\ - k, so all of the [/-symbols 
will be used. Finally, note that U — ► ai . ,.a\%\ is a rule for any a S S and that [/{—»[/ is 
a rule for any [/,-. 

Step 2. For each e t - 6 E, construct a node dominating the (unique) occurrence of /(e,) t - in 
<t(V, E, k) and label it Hi. Step 2 cannot conflict with step 1 because /(e,) S V, hence 
j*"*s /( e i) 4- V ~ V • Different parts of step 2 cannot conflict with each other because each one 

affects a symbol with a different subscript. Also note that /(e^ is an endpoint of e,- and 
that Hi — > a,- is a rule for any e,- € E and a an endpoint of e,'. 

Step 3. Number all occurrences of terminals in a(V, E, k) that were not attached in step 1 
or step 2. For the «th such occurrence, construct a node dominating the occurrence and 
label it D. Then construct another node dominating the D-node and label it /),-. Note 
that the stock of .D-symbols runs from 1 to (A; — 1) • \E\. Exactly (\V\ — k) ■ \E\ symbols 
of <j{V, E, k) were accounted for in step 1. Also, exactly |i?| symbols were accounted for in 
step 2. The length of a(V, E, k) is |V| • |2?|, hence exactly 

\V\.\E\-(\V\-k).\E\)-\E\ = \V\.\E\-\V\-\E\ + k-\E\-\E\ 

symbols remain at the beginning of step 3. D — + a is a rule for any a 6 S; D{ — > D is a 
rule for any D{. 

Step 4- Finally, construct a node labeled START that dominates all of the Hi, [/,-, and D,- 
nodes constructed in steps 1, 2, and 3. The rule 

START -*H t ... H m Ui . . . V m „ k D t . . . D| B |.( fc -i) 

is in the grammar. Note also that nodes labeled Hi,. .. , Ht E \ were constructed in step 2, 
nodes labeled Ui, . . .-, Civ" j— * were constructed in step 1, and nodes labeled D\, . . . , J5|E|-{fc— l) 
were constructed in step 3. Hence the application of the rule is in accord with the grammar. 
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HiHiDiDiDsDtDs 



D 6 D 7 H 3 D S H 5 D Q D 10 
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U 



D D 




DuDuDuHtDuHtHr 



D 



D D 



U 



D D D 




D 



V t V 2 V S Vi V 5 V 6 VjWiWiWsWiWsWQWjXi X 2 X Z X t X 5 X 6 X 7 ymVsVmVWl Zl Z 2 Z 3 Zi z 5 z 6 z 7 

Figure 7: This parse tree shows how the grammar shown in Figure 6 can generate the string 
o(V,E,k) constructed in Lemma 9.4 for the VERTEX COVER problem of Figure 5. The 
corresponding VERTEX COVER solution V = {v,x,z} and its intersection with the edges 
can be read off by noticing wliich terminals the //-symbols dominate. 
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Then a(V, E, k) 6 L(G). (Figure 7 illustrates the application of this parse-tree construction 
procedure to the grammar and input string derived from the VERTEX COVER example 
in Figure 5.) 

Conversely, suppose cr(V, E, k) & L(G). Then the derivation of a(V, E, k) from START 
must begin with the application of the rule 

START -*Hx... H\ E \Ui . . . U\ v ^ k D 1 . . . D\ E \.( k -i) 

and each Hi must later be expanded as some subscripted terminal g(Hi). Define /(e t ) to 
be g(Hi) without the subscript; then by construction of the grammar, /(e,) is an endpoint 
of e,- for all e,- G E. Define V — {/(e,) : e,- € E}\ then it is apparent that V' C V and that 
V contains at least one endpoint of e,- for all e,- S E. Also, each £/{ for t £ [1, J V| — k] 
must be expanded as U, then as some substring (a;)i . . . (o,)|^i of a(V,E,k). 16 Since the 
substrings dominated by the Hi and Ui must all be disjoint, and since there are only \E\ 
subscripted occurrences of any single symbol from V in a{V, E, k), there must be |V| — k 
distinct elements of V that are not dominated in any of their subscripted versions by any 
Hi. Then \V - V'\ > \V\ - k. Since in addition V C V, \V'\ < k. Then {V,E,k) 6 
VERTEX COVER. □ 

Theorem 1: SIMPLE UCFG RECOGNITION is NP-complete. 

Proof. SIMPLE UCFG RECOGNITION is in the class MP by Lemma 9.3, hence a poly- 
nomial-time reduction of VERTEX COVER to SIMPLE UCFG RECOGNITION is suffi- 
cient. Let (V, E, k) be a possible instance of VERTEX COVER. Let G be G{V, E, k) and a 
be o~(V, E, k) as constructed in Lemma 9.4. Note that G is simple. 

The construction of G and a can, by lemma, be carried out at time at worst polyno- 
mial in ||£?||, ||V||, and k. Also by lemma, (G,a) € SIMPLE UCFG RECOGNITION 
iff (V,E,k) 6 VERTEX COVER, k is not polynomial in ||A;|| under a reasonable encoding 
scheme. However, \E\ > fc, hence ||i?|| > ||fc||; also \\{V, E, k)\\ > \\E\\, hence ||{K,.&\Jfc)|| > k, 
all by properties assumed to hold of ||-||. Then G and a can in fact be constructed in time 
at worst polynomial in ||{V, E, k)\\. 

Hence the VERTEX COVER problem is polynomial-time reduced to SIMPLE UCFG 
RECOGNITION. □ 



f m \. 



1B The grammar would allow the substring (a<)i . ..(ai)|£| to appear in any permutation, but in o(V, E, k) 
it ajipears only in the indicated order, 

21 



10. References 

Barton, E. (1984). "Toward a Principle-Based Parser," A.I. Memo No. 788, M.I.T. Artificial 
Intelligence Laboratory, Cambridge, Mass. 

Berwick, R. (1982). "Computational Complexity and Lexical- Functional Grammar," Amer- 
ican Journal of Computational Linguistics 8.3-4:97-109. 

Berwick, R., and A. Weinberg (1984). The Grammatical Basis of Linguistic Performance. 
Cambridge, Mass.: M.I.T. Press. 

Chomsky, N. (1980). Rules and Representations, New York: Columbia University Press. 

Chomsky, N. (1981). Lectures on Government and Binding. Dordrecht, Holland: Foris 
Publications. 

Earley, J. (1970). "An Efficient Context-Free Parsing Algorithm," Comm. ACM 13.2:94- 
102. 

Garey, M., and D. Johnson (1979). Computers and Intractability. San Francisco: W. H. Free- 
man and Co. 

Hopcroft, J., and J. Ullman (1979). Introduction to Automata Theory, Languages, and 
Computation. Reading, Massachusetts: Addison- Wesley. 

Miller, G., and N. Chomsky (1963). "Finitary Models of Language Users," in R. D. Luce, R. 
/*"*\ R. Bush, and E. Galanter, eds., Handbook of Mathematical Psychology, vol. II, 419-492. 

New York: John Wiley and Sons, Inc. 

Shieber, S. (1983). "Direct Parsing of ID/LP Grammars." Technical Report 291R, SRI 
International, Menlo Park, California. Also appears in Linguistics and Philosophy 7:2. 



J*"*\ 



22 



